Objects, Data & AI: Build thought process to develop Enterprise Applications
that seems more natural or creative and avoids repetition, Random Sampling is used to introduce the
variability. However, it may be possible that output may become too creative or wander off the topic.
Say, in the first iteration, the model predicts the French word (token) “Je” (meaning “I” in English) with
the highest probability. The predicted token “Je” is then passed back into the decoder loop. This token,
along with the contextual representation from the encoder, guides the decoder to predict the next token. In
the subsequent iteration, the decoder predicts the next word (token) “T'aime” (meaning “Love You” in
English) based on the previously predicted “Je” and the contextual information. The decoder considers
both the input context and the tokens generated so far to make informed predictions.
The decoder continues this process, predicting tokens iteratively until it predicts an end-of-sequence token,
which indicated that the translation is complete. The sequence of predicted tokens, which includes “Je”
and “T'aime”. is detokenized to convert it into human-readable text. In this case, the translation output
would be “Je T'aime”, which means “I Love You” in French.
The original Transformer model had only 6 encoder and decoder layers. However, more recent models have
had hundreds or even thousands of layers. This allows the models to learn more complex patterns in the
data, but it also makes them more difficult to train. In fact, Transformers are the foundation behind
Language models and present-day Large Language Models. The evolution of transformer architectures has
led to significant improvements in the performance of language models. Today, transformers are the state-
of-the-art approach for a wide range of NLP tasks, including machine translation, text summarization,
question answering, and natural language inference. Understanding the context of words and generating
coherent responses, makes the Transformers well-suited for the tasks mentioned.
A year later in 2018, Google introduced BERT, based on the transformer architecture, which brought a
breakthrough in NLP by introducing bidirectional pretraining. Prior to BERT, most language models were
trained to process text in one direction only, either left-to-right or right-to-left. This made it difficult for
them to capture the full context of a sentence, as they could only see the words that came before or after
the current word.
BERT considers both directions, capturing richer contextual information. This significantly improved the
quality of contextual embeddings, benefiting tasks like language understanding, sentiment analysis, and
more. For example, on the GLUE benchmark, which is a suite of natural language processing tasks, BERT
achieved state-of-the-art results on 11 of the 12 tasks. This includes tasks such as question answering,
natural language inference, and sentiment analysis.
BERT's bidirectional pretraining enables the model to learn the contextual relationships between words that
appear in different parts of a sentence. This is essential for disambiguating words with multiple meanings
and for understanding how words influence each other's interpretation within a given context. For example,
the word "bank" can refer to a financial institution or the side of a river, and BERT can learn which meaning
is more likely based on the surrounding words.
One of the limitations of traditional models that process text sequentially, such as RNNs, is their inability
to capture long-range dependencies effectively. While RNNs can be used for generative AI tasks, they
struggle with compute and memory, making it hard to keep context in longer texts. The transformers
architecture is more parallelizable, and its dynamic attention mechanism helps to capture long-range
dependencies in the input. BERT overcomes this limitation by considering words that are not adjacent to
each other in the sentence. For instance, BERT can learn that words like "love" and "hate" are often
associated, even if they appear in separate sentences. This ability to capture relationships between distant
words is crucial for understanding the overall sentiment, tone, and meaning of a piece of text.